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Abstract. We consider Metropolis Hastings MCMC with tar- 
get tv (8) in cases where the log of the ratio of target distributions 
D = log(iv(8')/Tv(8)) is replaced by an estimator D(W). The esti- 
mator is based on m samples W = {W\, W2, W m ) from an inde- 
pendent online Monte Carlo simulation. Under some conditions on 
the distribution of D(W) the process resembles Metropolis Hast- 
ings MCMC with a randomized transition kernel. When this is 
the case there is a correction to the estimated acceptance proba- 
bility which ensures that the target distribution remains the equi- 
librium distribution. The simplest versions of the penalty method 
of Ceperley and Dewing 1999 [6], the universal algorithm of Ball et 
al. 2003 [3] and the single variable exchange algorithm of Murray 
et al. 2006 [TS] are special cases. In many applications of interest 
the correction terms cannot be computed. We consider approxi- 
mate versions of the algorithms. We show that on average O(m) of 
the samples realized by a simulation approximating a randomized 
chain of length n are exactly the same as those of a coupled (exact) 
randomized chain. Approximation biases Monte Carlo estimates 
with terms 0(l/m) or smaller. This should be compared to the 
Monte Carlo error which is 0(1/ y/n). 



1. Introduction 

Monte Carlo simulation offers a direct route to statistical inference 
for many otherwise awkward fitting problems. The class of problems 
which may be treated using Markov chain Monte Carlo (MCMC) and 
Sequential Monte Carlo has grown a great deal since the core algorithms 
were proposed [121 E]. One of the most important recent advances 
[TT} HI [21 d] has given us pseudo-marginal MCMC algorithms which are 
useful for some doubly intractable distributions. Such distributions are 
hard to simulate as we cannot readily compute the ratio of densities at 
two values of the target variable. Algorithms with a pseudo-marginal 
target distribution put an estimate for the target distribution in the 
Monte Carlo state along with the target variable. In this paper we 
give a class of MCMC algorithms characterized by a pseudo-marginal 
transition probability kernel. In these algorithms the Monte Carlo state 
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is just the target variable. The class as a whole seems to be unknown in 
statistical inference, though specific examples have appeared in various 
contexts in the physics literature. 

The MCMC algorithms we describe are, for the most part, exact only 
in the case that the ratio estimator has a known log-normal distribu- 
tion. By 'exact' we mean ergodic, with equilibrium distribution equal 
the desired target, not 'perfect' in the sense of Propp and Wilson 1996 
[TB] and 1998 [T7]. The restriction to log-normal estimators corresponds 
to the case where we have a normal estimator for the log-likelihood. We 
know of no real problem which presents this feature. However, a ratio 
estimator may be approximately log-normal when there is an appropri- 
ate CLT. We show that an inexact MCMC algorithm may be adjusted 
to yield, with high probability, just the same samples as a suitably cou- 
pled exact algorithm, in any simulation of fixed length. At any fixed 
precision of the overall Monte Carlo estimate the approximation error 
may be zero. 

The paper has five sections. In Section [2] we give a class of Metropolis 
Hastings MCMC algorithms with a pseudo-marginal transition kernel. 
This is randomized MCMC. In Section [3] we show that three existing 
MCMC algorithms for doubly intractable problems are special cases of 
the algorithm described in Section [2j In Section [4] we give a coupling- 
separation algorithm motivating the use of approximate MCMC. In 
Section[5]we give two very simple examples, chosen so that the coupling 
separation algorithm can be implemented exactly. We conclude with a 
brief discussion of the results in this paper. 



2. Randomized Metropolis Hastings algorithms 

2.1. Standard MCMC. We begin with the standard MCMC algo- 
rithm of Hastings 1970 [8] and Metropolis et al. 1955 [12]. We call this 
the s-algorithm (standard algorithm). Our notation follows Tierney 

Let 9 ~ n(d9) be a target variable with state space E, having a 
distribution n(d9) = n [9) fi(d9) which has a density tt(6) with respect 
to a measure fi(d9) defined for sets in a sigma- algebra £ of subsets of 
E. Let Q{9,d9') = q(9,9')/i(d9') be a Hastings proposal distribution 
with density q with respect to /i, satisfying q(9, 9') > -v4> q(9', 9) > 0. 
Let 

so that 

a(9,9') = min{l,/i(0,0')} 
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is the standard Metropolis Hastings acceptance probability. Let 

(2.1) P (0,8>) = { q(9 ' (r) ^ ff) Itff 

be the zeroed transition probability density in the s-algorithm. Let 

r(0) = 1 - ! p(9,6')fi(d9') 



je 

give the probability for a rejection, and let Ig=o' be the indicator func- 
tion for the event 6 = 6'. The transition probability distribution for 
s-MCMC is 

P(6, d&) = p(6, 0>(d0') + r(9)h=o'- 

2.2. Randomized MCMC. We now give algorithms with a random- 
ized acceptance probability. We call these algorithms r-algorithms. 

Let X be a real scalar random variable with probability density 
£(x;0, 0'). We assume the support W of £ is independent of 6 and 
6'. Let / : W — > W be an involution, i.e. / satisfies f(f(x)) = x. 
The involution / can be thought of as pairing points (x,f(x)) in W. 
We assume that / has a derivative at £ G W. Examples of 

suitable involutions are the trivial involution f(x) = x, and the family 
of functions 

x ax + b 
f( x ) = 

cx — a 

with a 2 + bc ^ 0. 
Let 

h,(6,6';x) = h(6 : 6') ^^ \f'(x)\ 

so that 

(2.2) a € (0, 6'; x) = min {1, h^(6, 6'; x)} 

gives an acceptance probability which is randomized by X. 
The r-algorithm is as follows. 

Algorithm 1 (r-MCMC). At state Qt = 6, simulate Qt+i as follows: 

(1) Simulate 6' ~ q(0, •) and x t ~ £(■; 0, 0')- 

(2) With probability a^(6,6';xt) set 0t+i = 0' and otherwise set 

et+i = 6. 

We now show that the r-algorithm simulates a transition kernel that 
is in detailed balance with ir, so that {Qt} t=0 12 is a Markov chain 
targeting ir. The acceptance probability in the r-algorithm is 

(2.3) a ^6,6')= [ £(x;0,0')a s (0,0';x)cfe 

Jw 
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and we wish to establish detailed balance, i.e., 

(2.4) tt(%(0, 9')a^(9, 9') = ir(9')q(9', 9)a^9\ 9). 
Multiplying both sides of Eqn [272] by n(9)q(9, 9')£(x; 9, 9') shows that 

(2.5) tt(%(0, 9')i{x] 9, 9')a^9, 9'; x) = 

min M9)q(9, 9')£(x; 9, 9'), rr(9')q(9', 0)£(/(ar); 9', 9)\f'(x)\} . 

Similarly, 

(2.6) n(9')q(9',9)ay,9',9)a^9',9;y) = 

min {ir(9')q(9', 0)£(y; 9', 9), n(9)q(9, 9')C(f(y); 9, 9')\f'(y)\} . 

In Eqn l2.6l set y = f\x) so x = f(y) and f'(y) = 1/ f'(x) by the inverse 
function theorem. It follows that 

(2.7) 7r(9')q(9', 0)£(/(x); 9\ 9)^(9', 9; f(x))\f'(x) \ = 

min {n{9')q{9\ *)£(/(*); 9\ 9)\f{x)\,it{9)q{9, 9')£(x; 0, 9')} 

and this equation has RHS equal to the RHS of Eqn 12.51 It follows 
that the LHS of Eqns [2751 and 12.71 are equal, i.e., 

(2.8) n(9)q(9,9')a^9,9';x)ax;9,9') = 

n(9>)q(9>, 9)^(9', 9- f(x))t(f(x); 9', 9)\f(x)\. 

This is 'very detailed balance'. Integrating Eqn l2.8l over all x in W gives 
detailed balance on average, Eqn 12.41 Besag et al 1995 [5] show that 
continuously indexed proposal distributions, each satisfying detailed 
balance on its own, can be used in MCMC In our setup the transition 
kernels satisfy detailed balance in pairs, the kernel at x with the kernel 
at f(x). 

In the next section we mention algorithms with more than one ran- 
domization. We claim without proof that the properties of r-MCMC 
established in this and the next sub-section hold under the general- 
ization from scalar to multivariate X. That is, if X = (X\, Xk) 
has multivariate density £(x; 9, 9') with support W in §l K , and / is an 
involution of W having Jacobian f'(x) with determinant \f ; (x)\, then 
the r-algorithm in Alg [JJ targets it. 

2.3. Properties of r-MCMC. In this section and Appendices A and 
B we show that r-MCMC is less statistically efficient than the s-MCMC 
from which it is derived, and give a sufficient condition for the r-chain 
to inherit 7r-irreducibility and minorization from the corresponding s- 
chain. These results can be used to establish ergodicity in particular 
cases. 
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First, r-MCMC is less statistically efficient than the s-MCMC. We 
show (in Appendix A) that, for all 0, 9' G E, 

(2.9) a^9,9') <a(9,9'). 

Since the two chains have the same proposal distribution Q(9, d9'), this 
puts the r-chain below the s-chain in the ordering of Peskun (1973) 
[16], and so r-chain estimators have greater asymptotic variance than 
corresponding s-chain estimators. 

We next give a sufficient condition for r-chain 7r-irreducibility and 
minorization. Let p%(9, 9') be given by Eqn l2.ll with a (6, 9') replaced by 
a ? (0, 9') (see Appendix B for details). Let r f (0) = l-j E p^9, 0')//(d0') 
and 

P 6 (9, d9') = P( :{9, 9')fi{d9') + r € (0) W- 
Suppose there exists a constant e > such that 

for X ~ £(•; 0, 0'), independent of and 0' in E. If £ does not depend on 
0, 0', as in the penalty method example [6] below, condition Eqn 12.101 
is automatically satisfied (see Appendix B). 

We show (in Appendix B) that if Eqn 12.101 holds then 

(2.11) P € (0,C) > eP(0,C). 

Because the s-chain is 7r-irreducible by assumption, for each C G £ 
such that 7r(C) > and for 7r-a.e. G E there exists n = n(9, A) 
such that PW(0, C) > 0. However, if EqnEHholds then P ? (n) (0, C) > 
e rt P( n )(0,C), so the r-chain is 7r-irreducible also. 

Suppose the s-chain satisfies a minorization condition M(m, (3, C, v). 
This means that there exist m > 1, /3 > 0, a set C £ £ and a probability 
measure ^ on £ such that z/(C) > and P( m '(x,B) > f3v(B) for all 
x G C and all P G £. It follows from Eqn 12. Ill that the r-chain satisfies 
a minorization condition M(m, e m /3, C, z/). If the s-chain is uniformly 
ergodic, then so is the r-chain. 

2.4. Example. We now give a simple example illustrating the r-algorithm. 
Suppose we have an s- algorithm targeting 7r(0) with symmetric pro- 
posal q(9, 9') = q(9', 0) and acceptance probability 



7r(0) 

We can randomize this acceptance with a normal density £(x; 0, 0') = 
N(x;a,b) having mean a = log(7r(0')/7r(0)) and variance 6 = 1, and 
use the identity involution f(x) = x, to get the following r-algorithm: 
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Algorithm 2 (Example). At state 0( = 9, simulate Qt+i as follows: 

(1) Simulate 9' ~ q(9, •) and z t ~ N(0, 1). Set 

zt = \og(ir(9')/ir(9)) + z t . 

(2) With probability 

set 0t+i = 6>', otherwise set Qt+i = 9. 

The example algorithm satisfies detailed balance with respect to tt. 
The correction term can be interpreted as kind of random tempering, 
since the random power flattens the target distribution when < 1 — 
2xt < 1. However this 'tempering' does not decrease the integrated 
autocorrelation time of the Markov chain. As shown above, the r- 
chain is dominated by the s-chain in the Peskun ordering, and hence 
the integrated autocorrelation time is not decreased. 

3. Known randomized Metropolis Hastings algorithms 

We now describe some existing MCMC algorithms for doubly in- 
tractable distributions and show that they are r-algorithms. From 
this point on, we assume for ease of exposition that the proposal den- 
sity is symmetric, q(9,9') = q(9',9), since the Hastings extensions are 
straightforward. Suppose 

D(9,9')=\og(n(9')/n(9)) 

is an intractable function of (9,9'). The acceptance probability in the 
s-algorithm for ir(9) is 

a(0,0') = min(l,exp( J D(0,0 / )))- 

Let Dg t Qi = Do,e'(W) be an estimator for D(9,9') computed from a 
collection of m random variables W = (Wi, W2, W m ). Estimator 
Dq^i has cdf G m (-; 9, 9') and density g m (-] 9, 9'). Let 

a 2 = lim var(y/mD g e'(W)), 

m— >oo 

so that the estimaor variance is asymptotically o~ 2 /m. For example, if 
the Wi are iid, and Dg t gi(W) = W, then cr 2 = var(Wi). 

We begin with a 'naive' incorrect algorithm which is not itself an 
r-algorithm. We refer to an algorithm as naive when an estimate is 
plugged into the Metropolis Hastings acceptance probability, without 
correction. The 'MCWM' algorithm in Andrieu and Roberts 2009 [2] 
is an algorithm in this class. Although inexact, the algorithm may be 
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useful. Beaumont 2003 jl] has shown that the approximation bias may 
be small, and the resulting (inexact) chain seems to have better mixing 
properties than the corresponding (exact) pseudo-marginal MCMC. 

Algorithm 3 (Naive algorithm). At state Q t = 9, simulate Qt+i as 
follows: 

(1) Simulate 9' ~ q(9, •) and an estimate x t ~ <7m("j 9, 9'). 

(2) With probability 

a N (0,0') = min{l,e xt } 
set f+ i = 9', otherwise set t+ i = 9. 

The naive algorithm does not in general target ir, and may not even 
have an equilibrium distribution. 

The penalty method of Ceperley and Dewing 1999 [BJ corrects the ac- 
ceptance probability in the naive algorithm. A variant of this algorithm 
is used by these and other authors to simulate Gibbs distributions of 
interest in physical chemistry. Following Ceperley & Dewing 1999 [BJ), 
we suppose a normal estimator De,e> ~ N(D(9, 9 ; ), a 2 (9, 9')/m) is avail- 
able. We assume a(9, 9') = a(9', 9) and write a = a (6, 9') below. 
Ceperley & Dewing 1999 [BJ show that the following algorithm, which 
they call the 'penalty method', targets ir(6) exactly. 

Algorithm 4 (Penalty method (Ceperley & Dewing 1999 [BJ)). At state 
Q t — 0) simulate Qt+i as follows: 

(1) Simulate 9' ~ q(9, •) and an estimate y t ~ N(D(9, 9'), a 2 fm). 

(2) With probability 

a P (9,9') = min{l,e ?/i - <T2/2m } 

set ©t+i = 9', otherwise set Qt+i = 9. 

Ceperley and Dewing 1999 [Bj show that detailed balance is satisfied 
by carrying out the integrals over y t needed to verify Eqn 12.41 We 
now show that the penalty method is an r-algorithm. Let X t have a 
normal density, 9, 9') = N(x; 0, cr 2 m) and take for / the involution 
f(x) = a 2 — x. It follows that 

and hence the acceptance probability in the r-algorithm at step t is 
a^9,9'-x t )mm{l,e D ^ +x ^ 2 / 2m }. 
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Since D(9, 9') + X t (in the r-algorithm) and Y t (in the penalty method 
in AlgEJ have the same distribution, the algorithms are equivalent and 
the results in Sec I2.2I prove that the penalty method targets ir. 

Ceperley and Dewing 1999 [6] give other more general but approxi- 
mate algorithms, the simplest of which is the Penalty Estimate method. 
We analyze this alongside the naive algorithm in Section HI Suppose we 
have a unbiased normal estimator for D, but do not know its variance, 
and so try replacing a 2 with the sample variance s 2 . 

Algorithm 5 (Penalty Estimate method (Ceperley & Dewing 1999 [6])). 
At state Qt = 9, simulate Ot+i as follows: 

(1) Simulate 9' ~ q(9,-), an estimate y t ~ N(D(9,9'),a 2 /m) and 
an independent variance estimate s 2 with 

(m — l)s\/a 2 ~ X 2 ( m ~~ !)• 

(2) With probability 



set Qt+i = 9', otherwise set Qt+i = 9. 

This algorithm is inexact. It is not an r-algorithm. 

Ball et al. 2003 [3] give a 'universal rule' which applies for sym- 
metric error distributions with compact support and does not require 
that the error variance a 2 be known. In the case where the error 
distribution is normal the algorithm can target only a very good ap- 
proximation to 7r. We can show that the universal rule for a nor- 
mal error distribution is an r-algorithm, with two randomizations, 
fi(ui;0,0') = N( Ul ; 0, a 2 /m) and £ 2 (u 2 ;0,0') = N(u 2 ; a 2 /m, a 2 /m) 
and the same involution, f(ui) = a 2 /m — Ui,i = 1,2, for each. In 
this bivariate randomization the unknown variance appears in a factor 
multiplying the whole acceptance probability. The acceptance proba- 
bility may be simulated by a form of rejection. Promising though it 
is, we do not discuss the universal algorithm here, as there is no ex- 
act algorithm for the normal case and we make no further use of the 
connection. 

The final algorithm we discuss is the single variable exchange algo- 
rithm of Murray & MacKay 2006 [15]. We present this in its original 
Bayesian setting, with data d = (d±, ...,d n ), prior p{9) and Likelihood 
L(9,d) = c(9)L(9,d). The target is now the posterior distribution, 



The likelihood has an intractable normalizing constant c(9). Murray & 
MacKay 2006 [15] arrange things so that this cancels in the Hastings 




ir(9\d))ocp(9)L(9,d). 
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ratio, developing an idea due to M0ller et al. 2004 [13]. However, while 
M0ller et al. 2004 [T3] augment the MCMC state with an auxiliary 
variable, the auxiliary variable in the exchange algorithm is associated 
with a single update, as in the r-algorithm. 

Algorithm 6 (Single Variable Exchange). At state 4 = 9, simulate 
Qt+i as follows: 

(1) Simulate 9' ~ q(9, •) and x t ~ L{ff , ■). 

(2) With probability 



Factors of c(9) and c(9') cancel. This is an r-algorithm with 9, 9') = 
L(9',x) and the identity involution f(x) = x. This is the first useful 
instance we have given for r-MCMC (ie, excluding the example in Sec- 
tion [2]4]) in which £ actually depends on 9' . 

How do these identifications of existing algorithms as r-algorithms 
help us? Both Ceperley & Dewing 1999 [B] and Ball et al. 2003 [3] treat 
detailed balance as an integral equation, integrated over the random 
variation introduced by the estimator, D. The acceptance probability 
is obtained as a solution of this integral equation. In fact the 'very 
detailed' balance relation Eqn 12.81 shows that the functions under the 
integrals are equal. We expect that this will help with the development 
of new algorithms. 

4. Separation times and approximate- target MCMC 

In this section we give a coupling strategy which motivates the use 
of the naive algorithm in some cases. We present and analyze a cou- 
pling algorithm to show that, on average, the naive algorithm gives 
exactly the same MCMC samples as the exact penalty method, out to 
0(m) steps of the naive chain, where m is the sample size used in D- 
estimation in the naive chain. The error in this algorithm is analyzed 
in Andrieu and Roberts 2009 [2] from a different perspective. 

Recall that Defi' is an estimator for D(9, 9') with cdf G m (-; 9, 9'). We 
do not assume Dq^ 1 is unbiased or normal. We do assume it satisfies a 
CLT, so that 




set 9 



t+i = 



9' and otherwise set 



t+i = 



9. 



(4.1) 
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with a 2 the asymptotic variance of y/mDg^i. For example, if Do,6' is 
computed from a realization of a geometrically ergodic Markov chain 
W = {Wi}Z and 

1 m 

(4.2) Dggi = — Wi, 

i=i 

then Eqn 14.11 holds, subject to mild additional conditions specified in 
Kontoyiannis and Meyn (2003) [9]. Results of this kind may be used 
with the delta method to get asymptotically normal ratio estimators. 
If Eqn 14. II holds for Dg gt, then it can be coupled to a normal estimator. 



£W = D 



a D-D 



m a/y/m 



D + 4=$ _1 (Gm0) + 0(m- 1/2 )) 



m 

(7 



(4.3) = J D + ^$- 1 (G m (D)) + 0(l/ 



where $ _1 (G m (-D)) is a standard normal random variable. 

We now give the coupling algorithm. In our example, we couple 
the naive and penalty method chains. Couplings of this kind may 
be applied to other pairs of algorithms. The algorithm simulates the 
penalty method and also an indicator variable B t 6 {0, l},t = 1,2, ... 
marking the times at which the naive chain separates from the penalty- 
method chain. 

Algorithm 7 (Coupling algorithm: penalty method and naive algo- 
rithm). At state Q t = 8, simulate B t and Qt+i as follows: 

(1) Simulate 6' ~ q(9, •) and an estimate x t ~ g m {-] 0,9'), and set 

y t = D + ^=^\G m {x t )). 
Jm 



(2) Simulate V t ~ U(0, 1). Let 

ap(^,^^)=min{l,e^ 2 / 2m }. 

If Vt < ap then set Qt+i = 0', otherwise set Qt+i 

(3) Let 

a N (6, 6'; x t ) = min {1, e Xt } . 

If 

min(aN, Qp) < V t < max(a N , q p ) 
then set B t = 1 and otherwise set B t = 0. 
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The Gi chain targets it exactly, as it is a penalty method chain. The 
two chains (naive and penalty method) separate when V t falls between 
the two acceptance probabilities, since then the chains make different 
accept/reject decisions. If this first separation time is larger than the 
run length, then the naive algorithm realizes the same samples as the 
penalty method, and the error from using an inexact chain is unde- 
tectable at the overall MCMC precision. We emphasize that we can- 
not usually implement the coupling algorithm as we cannot in general 

compute ^(Grr^Defii))- 

We now give a lower bound on the mean time to separation, assuming 
that the chains start in equilibrium. Let T = min{t > 1; B t _i = 1} 
be the first passage time to separation. We assume Pr(T < oo) = 1. 
Next, condition the process on a separation at the first trial, Bq = 1, 
set T = and let 

Ti = min{t > 7U; B t = 1, % = 1, 2, 3, ...\B = 1} 

be the sequence of separation return times. The intervals — Tj_i 
are the random intervals between separation events. The B t process is 
not in general a Markov chain, and the intervals are not iid. However, 
conditional on B = 1, the B t process is a stationary discrete process, 
with interval-stationary separation return times. Let p = E(7\) be 
the mean separation return-time. By Kac's Recurrence Theorem for a 
stationary discrete process, p = l/Pr(S = 1). Let 

E\a P -a N \ = [ \a P (8,8 , ;6)-a N (8,0';5)\g m (5)d5Q(9,d9 , )Tr(d9). 

JE 2 xR 

The mean separation return-time is 



The separation time from the initialization O = #0 is t(9 ) = E(T|B = 
6> ). Let r = J E r(9)7i(d9) give the mean separation time starting in 
equilibrium. This time is the expected first passage time to a sep- 
aration event B t = 1 from an equilibrium start for O . The return 
time of an interval-stationary process bounds the first passage time by 
2r > p, as the return time around a fixed time is length-biased, and the 
fixed time is uniformly distributed in an interval between separation 
events. It follows that the mean separation time r grows at least lin- 
early with increasing m, since \ap— ccjv| = 0(1 jm) in Eqn l4.4l We have 
not bounded the separation time t{6q) from any particular start state. 
However we assume t(9q) ~ r as the the event ap = ot^ = 1 occurs 
more frequently during convergence, and the chains cannot separate on 
these events. 
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If we couple the naive algorithm to the standard algorithm, s-MCMC, 
with acceptance probability a, we find E|a — (Xn\ — 0(1/ \/m). The 
naive algorithm is therefore 'closer' to the exact penalty method, in 
the 'distance' E|«p — than it is to the exact standard algorithm. 
We improve the naive approximation to the target by improving the 

approximation G m (x) ~ $ \ j^j== \ ■ 

The Penalty Estimate method (Algorithm 5) has separation times of 
0(m 3 / 2 ) at the price of stronger conditions on the distribution of the 
estimator Dg^i. The estimator xt ~ N(D(6,d'),a 2 /m) in the Penalty 
Estimate method is exactly normal, but we do not know <r 2 , and pro- 
ceed as in Algorithm 5. The estimator is normal, so it can be coupled 
to the (exact) Penalty Method using the trivial coupling y t = x t . Now 
since 

(m-iy/^-(m-l) n 
y/2(m-l) 

the two acceptance probabilities ap and ap differ by terms of 0(m 3 ^ 2 ). 
We give an example with this behavior in Section [51 

Finally we note that if the chains are independence samplers, the 
naive chain and the penalty method chain may separate and then coa- 
lesce. This kind of coupling is used in the 'perfect simulation' algorithm 
of Murdoch et al. 1998 [14] . The two chains are offered the same can- 
didate at each step, even after separation, and they coalesce when they 
both accept. If the time to coalesce is much smaller than the time 
to separate, the naive algorithm may give a very good approximation 
indeed. 

5. Examples 

In order to show the linear dependence of the separation time r on m, 
we give a very simple example for which we can compute ^~ 1 (G m (x)). 
Let 7r be an equal mixture of bivariate normals, with 9 = (9±, 82) and 

7,(9) = l - MVN(6- (A lt EO + l -MVN(6- ^ E 2 ), 

Hi = (3,3) T , ii 2 = (6,6) T , [E n ] M = 1 for a,i = 1,2, and [Ei]i, 2 = 1/2 
and [E 2 ]i, 2 = -0.5. If D = log(7r(0')M0)) and for % = l,2,...,m, 
Wi ~ Exp(l) then we use D = D — 1 + rn/^2 i Wi to estimate D. 
In this example G m (x) = F((x + l)/m;m, 1) where F(-;m, 1) is the 
cdf of an inverse Gamma variable with shape M and rate 1, so that 
E(D) = D + 1/(M - 1) and var(L>) = m 2 /(m - l) 2 (m - 2). In this 
example, D is a biased, non-normal estimator for D. 

In Fig [1] we plot realizations of coupled Penalty Method and naive 
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50 100 150 200 



MCMC update 

Figure 1. Simulations of the MCMC coupling- 
separation algorithm Oi + 02 in the Penalty Method 
(solid lines) and Naive Algorithm (dashed lines), (Top) 
with Random-Walk Metropolis updates and (bottom) 
with Independence- Sampler updates. Target density is a 
mixture of bivariate normals, D-estimator using m = 8 
samples at each update. 
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MCMC samplers using random walk Metropolis, and an Independence 
sampler. The means of the mixture components are on 9\ = 62 so 
we plot ©! + 2 against MCMC update to show the mixing across 
components. For illustration, we compute our estimator D from just 
eight independent WiS (we set m — 8). This ensures that the mean 
separation times are short, approximately 72 updates for the random 
walk update and 32 for the independence sampler. This is convenient 
for graphical display. In this simulation the random walk proposal (top) 
for the Penalty Method and naive samplers separate after 25 updates. 
The walks shadow one another as they use the same uniform numbers 
to generate proposals. The Independence samplers separate after just 
one update, but then coalesce, and go on to branch and coalesce many 
times. In an Independence sampler run of length 10000 updates with 
m = 8, 90% of the samples returned by the naive sampler are exactly 
equal to those returned by the Penalty Method. 

In Fig [2] we show the slight over-dispersion of the naive samplers 
compared to the target distribution. This is visible because we used a 
very small sample size for the .D-estimator D of m = 8 iid samples. In 
this example, the Penalty method samplers are exact, even with m = 8, 
and follow the target density to within error. 

The upper graph in Fig [3] demonstrates the linear dependence of the 
separation return times p and separation times r on the number of 
samples m used to form D. Two estimates of p are computed. The 
first is from Eqn 14.41 

A _ 1 

Pl ~ K' 1 ££1 \<* P {6 t , 6[- y t ) - a N (9 t , 6[- x t ) | ' 

For the second, p 2 = S~ l Yli=i(^i~ ^i-i) estimates p. The p± estimator 
has lower variance than the f>2 estimator. The r-estimator is the mean 
of 1000 realizations of T. The estimates are computed for Random- 
Walk Metropolis and Independence Samplers. The Naive Independence 
sampler separates more rapidly than the Naive Random- Walk sampler, 
on average, but can coalesce, so it generates estimates of similar bias in 
this example. The separation return times p and separation times r are 
approximately equal. The separation mark process B t is not in general 
Markov so K(T\Bq = 1) ^ E(T) in general. However in this example 
the separation return times Tj — Tj_i are not easily distinguished from 
iid geometric random numbers. 

In order to demonstrate the 0(m 3 / 2 ) behavior of separation times 
between coupled Penalty Method and Penalty Estimate Method chains 
we take W { ~ iV(0, 1) D = D + Y.i W i/ m an use s 2 (D u D m ) to 



RANDOMIZED MCMC AND COUPLED MCMC 



15 




5 10 15 



ThetaJ +Theta_2 

Figure 2. Density for X + G 2 in the target mixture 
of bivariate normals (solid curve) plotted with densities 
estimated using the Penalty Method (dashed curves for 
Independence and Random Walk updates) and Naive Al- 
gorithms (dotted curves). 

estimate a 2 . The estimate is unbiased and normal, but the variance 
is unknown. The separation time function r = r(m) displayed in the 
lower graph in Fig [3] grows very rapidly with m. 

6. Discussion 

We have presented two new algorithms. The first is an 'exact' 
MCMC algorithm for a randomized acceptance probability, in which 
the randomization is not part of the MCMC state. As this algorithm 
generalizes the simplest forms of of the penalty method, universal algo- 
rithm and single variable exchange algorithms, it may have applications 
in the simulation of doubly intractable target distributions. 

Randomized MCMC is complementary to pseudo-marginal MCMC 
(Lin et al. (2000) [IT], Beaumont (2003) 0], Andrieu et al (2009) 0) 
and Andrieu et al (2010) [JJ. In both algorithms an intractable target 
distribution is estimated using auxiliary random variables. However, 
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Figure 3. (Top) Estimated separation times p and 
r between the exact Penalty Method and the approxi- 
mate Naive Algorithm, as a function of estimator sample 
size m, for Random- Walk (solid lines) and Independence 
sampler (dashed lines) updates. Two estimates of p, p\ 
(left error bar in each group of three) and P2 (central 
error bar) and f (right error bar) are plotted for each 
sample and each m with a linear regression of the p2 es- 
timates. (Bottom) Estimated separation times, as above, 
between Penalty Method and approximate Penalty Esti- 
mate chains regressed with pi = cm 3 / 2 . 
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in randomized MCMC algorithms the auxiliary variables extend the 
transition kernel, and it is the transition kernel which has the correct 
marginal at each update (ie, integrating Eqn I2.8[) . whereas in pseudo- 
marginal algorithms, the auxiliary variables extend the target distribu- 
tion, and the target distribution has the correct marginal distribution. 
Randomized MCMC does not maintain auxiliary variables in the state, 
and this may be an advantage. Beaumont (2003) jl] notes that pseudo- 
marginal algorithms are prone to getting stuck when the target density 
is over-estimated. Simulation studies using very simple target densities 
and comparable exact pseudo-marginal and penalty method algorithm 
show the efficiency advantage of the Penalty Method increasing with 
increasing estimator variance. Against this, existing exact randomized 
algorithms impose much stronger conditions on the distribution of the 
estimator than pseudo-marginal algorithms. 

In fact, the class of randomized Metropolis Hastings MCMC algo- 
rithms we have described are themselves special cases of a very large 
class of MCMC algorithms called Active Particle Algorithms, described 
in Lee et al. (2011) [1 0J . This class of algorithms includes the pseudo- 
marginal MCMC and randomized algorithms as special well as 
mixed algorithms in which both the target distribution and transition 
kernel in detailed balance are both marginals. 

The second algorithm, the coupling-separation algorithm, is not di- 
rectly useful for doubly intractable problems, as the simple form of the 
exact penalty method algorithm is not tractable. However, it shows 
that under some conditions the naive algorithm generates the exact 
same sequence of samples as an exact penalty method algorithm, out to 
0(m) steps in the MCMC, where m is the sample size used to estimate 
the log of the Hastings ratio. This suggests strategies for improving 
naive simulation, using any method that tends to increase this sepa- 
ration time. The coupling-separation algorithm may help to suggest 
improvements in other settings involving an approximate likelihood in 
an MCMC algorithm. As an example, if we have a very large data 
set, and a log-likelihood given as sum over independent data, we may 
estimate the log-likelihood using a small sample of size m drawn with 
replacement from the data. The estimator is asymptotically normal in 
m. 

The coupling-separation algorithm complements the perfect simula- 
tion algorithm of Propp and Wilson. In that algorithm coupled MCMC 
chains for the same target start in different states and coalesce. In the 
coupling-separation algorithm coupled MCMC chains for different tar- 
gets start in the same state and branch. 
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Appendix A: Peskun ordering of s-chains and r-chains. We now 

prove the ordering given in Eqn 12.111 Let 

a = {x e w : £&o,er) < h(e,e')c(f(x) ; e',e)\f'(x)\}, 

so that 



(6.1) a^e,e')= / £{x-,e,ff)dx+ / h(e,e')am;e',e)\f(x)\dx. 

J A JW\A 

Replacing 9, 9') with h(9, 9')£(f(x); 9',9)\f(x)\ for x in A, 

as(9,9') < [ h(9,9')af(x);9',9)\f(x)\dx 
Jw 

= K9,9'), 

while replacing h(6, 9')£(f(x); 6', 9)\f(x)\ with £(x; 6, 6') for x in W\A, 

at{9,9') < [ £(x;9,9')dx 
Jw 
= I. 

It follows that az(6,9') < min(l, /i(0, 0'))- 

Appendix B: irreducibility and minorization of r-chains. We 

now prove that Eqn 12.111 follows from Eqn 12.101 Let 

4X) " ^9,9') lf{X)l 

so that a^(9, 9'; X) = min(l, h(9, 9')E(X)) and E depends on 9 and 9'. 
From the definition of a^(9, 9') in Eqn 12. 3[ 

(6.2) a^9,9') = E{a^9,9';X)) 

= E (a c (0, 9'- X)\E{X) > 1) Pr(E(X) > 1) + 

E(a^(9,9';X)\E(X) < l)Pr(S(X) < 1) 
> E(a(9,9')\E(X)>l)PT(E(X)>l) 
= a(0, 0') Pr(S(X) > 1) 

where the inequality in the third equation holds because a^(9, 9'\ X) > 
a(9,9 r ) given E(X) > I. 
Let 



(6.3) Pt(0,0') = { 



q{9,9')a^{9,9') 9^9' 
9 = 9' 



be the zeroed transition probability density in the r-algorithm. The 
condition in Eqn 12.101 gives Pr(S(X) > 1) > e for some e > not 
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depending on 9, 9', and so p^(9, 9') > ep(9, 9') for the transition density, 
from Eqns [2H O and E3J Let 



give the probability to remain in the same state, in the r-algorithm. 
By the Peskun ordering, r^(9) > r(9), and so trivially, r^(9) > er(9), 
and hence 



for each 9 G E and B e £. This is Eqn 12. Ill and we are done. 

If £ does not depend on 9, 9' then condition Eqn 12.101 is not needed. 
In that case < Pr(E(X) > 1) < 1, since if 



for £-a.e. x £ W, then the two functions cannot both be probability 
densities, which is a contradiction. This implies Eqn 12.101 as there is 
no 9, 9'- dependence in Pr(S(X) > 1). If there is 9' dependence, as in 
the single variable exchange algorithm in Section [3j then we need to 
exclude cases where Pr(S(X) > 1) — > as 9' approaches a boundary 
of E, in order that Eqn 12.111 hold for e independent of 9 and C. 
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